Fix multi-coordinate indexes dropped in _replace_maybe_drop_dims #11286
Fix multi-coordinate indexes dropped in _replace_maybe_drop_dims #11286benbovy merged 5 commits intopydata:mainfrom
_replace_maybe_drop_dims #11286Conversation
…_copy_listed
When a custom Index spans multiple coordinates across different dimensions
(e.g. a UGRID topology index covering both `nodes` and `faces`), it was
incorrectly dropped during:
- DataArray dimension reduction (e.g. `.mean("extra_dim")`) via
`_replace_maybe_drop_dims`, which filtered coords by strict dim subset
without consulting `Index.should_add_coord_to_array()`
- Dataset subsetting by variable names (e.g. `ds[["node_data"]]`) via
`_copy_listed`, with the same issue
Both paths now use `should_add_coord_to_array()` for index-backed
coordinates, consistent with how `_construct_dataarray` already works.
Fixes pydata#11215
Co-authored-by: Claude <noreply@anthropic.com>
|
@benboy, one more thing. I did ask Claude how if it used your suggestion in #11215 (comment) and it said:
|
- Cast OrderedSet to set when calling should_add_coord_to_array to satisfy the expected set[Hashable] type annotation - Replace dict comprehensions with dict.fromkeys (ruff C416) - Reformat assert messages to ruff-preferred style Co-authored-by: Claude <noreply@anthropic.com>
benbovy
left a comment
There was a problem hiding this comment.
Thanks @rsignell! That looks good to me.
For the case of Dataset.to_dataarray(), I think it should work similarly than in DataArray._replace_maybe_drop_dims():
coords = {}
for k, v in self.coords.items():
if k in self._indexes:
if self._indexes[k].should_add_coord_to_array(k, v, dims):
coords[k] = v
elif set(v.dims) <= dims:
coords[k] = v| assert isinstance(next(iter(reduced_node.xindexes.values())), MultiDimIndex), ( | ||
| "Index dropped from node DataArray after reduction" | ||
| ) | ||
| assert isinstance(next(iter(reduced_face.xindexes.values())), MultiDimIndex), ( | ||
| "Index dropped from face DataArray after reduction" | ||
| ) |
There was a problem hiding this comment.
| assert isinstance(next(iter(reduced_node.xindexes.values())), MultiDimIndex), ( | |
| "Index dropped from node DataArray after reduction" | |
| ) | |
| assert isinstance(next(iter(reduced_face.xindexes.values())), MultiDimIndex), ( | |
| "Index dropped from face DataArray after reduction" | |
| ) | |
| for da in [reduced_node, reduced_face]: | |
| for name in ["node_x", "node_y", "face_x", "face_y"]: | |
| assert name in da.coords | |
| assert isinstance(ds.xindexes[name], MultiDimIndex) |
The instance(next(iter(...), ...) checks should be enough but this suggestion is to make the expected behavior clearer to the reader (or LLM).
| assert isinstance(next(iter(node_subset.xindexes.values())), MultiDimIndex), ( | ||
| "Index dropped from Dataset when subsetting to node variable" | ||
| ) | ||
| assert isinstance(next(iter(face_subset.xindexes.values())), MultiDimIndex), ( | ||
| "Index dropped from Dataset when subsetting to face variable" | ||
| ) |
There was a problem hiding this comment.
|
Thanks for the review @benbovy! Here is the Claude Code plan to address your comments -- let me know if you would like changes! Planned changes1. Fix
|
|
At a first glance the the proposed changes look good! |
- Apply should_add_coord_to_array() in Dataset.to_dataarray() for consistent handling of multi-coordinate indexes (per benbovy review) - Add test_to_dataarray_preserves_multi_coord_index to cover this path - Simplify isinstance(next(iter(...))) checks to cleaner for-loops in both test_dataarray.py and test_dataset.py (per benbovy suggestion) Co-authored-by: Claude <noreply@anthropic.com>
|
@benbovy, okay, Claude thinks it addressed the comments. Fingers crossed! 😄 |
| variable = Variable(dims, data, self.attrs, fastpath=True) | ||
|
|
||
| coords = {k: v.variable for k, v in self.coords.items()} | ||
| broadcast_dims = set(broadcast_vars[0].dims) |
There was a problem hiding this comment.
Actually, a fix is not even necessary here since Dataset variables are broadcasted against each other and thus all coordinates are kept in the returned DataArray anyway, like mentioned in the docstrings. I added to_dataarray() to the list as I naively looked at all places where filter_indexes_from_coords is called, sorry.
(filter_indexes_from_coords is not even needed, copying self._indexes should be enough).
Per reviewer feedback, no fix was needed in to_dataarray() since Dataset variables are broadcasted and all coordinates are kept anyway. Replace the should_add_coord_to_array loop and filter_indexes_from_coords call with a simple dict copy of all coords and indexes. Co-authored-by: Claude <noreply@anthropic.com>
Co-authored-by: Claude <noreply@anthropic.com>
|
@benbovy, I asked Claude to double-check if the issue was addressed: |
|
LGTM too, thanks! |



@benbovy, after some discussion with @Huite, he thought Claude Code might be capable of fixing this issue, so I gave it a shot. I'm only an appreciative user of the xarray codebase, so I don't know really how to test/evaluate this beyond the passing tests that Claude itself came up with. So I hope this isn't a waste of your time. With that said, here's what Claude came up with:
When a custom Index spans multiple coordinates across different dimensions (e.g. a UGRID topology index covering both
nodesandfaces), it was incorrectly dropped during:.mean("extra_dim")) via_replace_maybe_drop_dims, which filtered coords by strict dim subset without consultingIndex.should_add_coord_to_array()ds[["node_data"]]) via_copy_listed, with the same issueBoth paths now use
should_add_coord_to_array()for index-backed coordinates, consistent with how_construct_dataarrayalready works.Description
Checklist
AI Disclosure
Tools: Claude Code